dataset = pd.read_parquet("dataset.parquet").drop_duplicates()This is a true story of how I lost money using machine learning (ML) to bet on CS:GO. The idea was originated with a friend, who gave me permission to share this story in public.
Check out the first post of the series here, which covers the theory and foundations necessary to understand what’s going on in this second post.
In this post, I will go over the actual implementation of the solution:
- CS:GO basics
- Data scraping
- Feature engineering
- TrueSkill
- Inferential vs predictive models
- Dataset
- Modelling
- Evaluation
- Backtesting
- Why I lost 1000 euros
Solution
CS:GO basics
Counter-Strike: Global Offensive (CS:GO) is a first-person shooter (FPS) multiplayer game. It can be played casually or competitively. When played competitively, it’s typically played on the following format:
- Two teams of 5 play against each other: Terrorists vs Counter-Terrorists
- Best of 3 maps (sometimes best of 1 or 5)
- Maps are played up to 30 rounds
- Each round can be won be killing the other team or by planting or defusing the bomb
- Each player has a number of kills (K), deaths (D), assists (A) and average damage per round (ADR)
If you don’t know much about video-games, don’t worry, you can treat CS:GO as any other team sport.

Web scraping
Data is the new oil.
As I explained in the first post, one of the reasons we chose CS:GO was data availability. Since we might have broken some term and conditions, I won’t name our exact sources, but they were easily found online.
We collected both match data and betting odds. Note that match data is easy to find retroactively, but betting odds need to be collected in real-time, which limited our ability to run backtests (more on this later). Betting odds data is super valuable, and maybe a better way to make money would have been to collect it across different websites and sell it.
We collected 3 years worth of match data over 30k matches. We managed to collect only 3 months of betting odds data, covering 1725 matches, with approximately 30 odds per match. Note that the odds fluctuate between the match announcement and match start.
Match data contained information such as the teams playing, team composition, kills and deaths for each player, rounds won, the map to be played, and the final score (win-loss-tie).
To scrape the data that we needed we used Selenium through a headless browser. Then, we parsed the resulting HTML with BeautifulSoup.
Feature engineering
Past behavior is the best predictor of future behavior.
With the match data, we created 100s of features1. Most features were related to past performance, such as the percentage of times team 1 won on the map to be played or against team 2. If the teams faced each other off before, who won back then is an important predictor now. We also used game score features like KD difference and ADR on a team and individual basis.
Note that we couldn’t use the betting odds as features, even though the information there is invaluable2. The reason is simple, as previously explained: we didn’t have a backfill for historical betting odds. We could only use the odds that were available after we started to collect them, which was only (barely) enough for backtesting.
Also, we had a trump card, which ended up being the most important feature: TrueSkill.
TrueSkill
TrueSkill is a Bayesian skill rating system developed by Microsoft for multiplayer games, a Bayesian version of the ELO rating. It aims to estimate the “true skill” of each player or team based on their performance history.
TrueSKill uses a Gaussian distribution to represent the skill level of each player, and it updates these skill levels after each match using Bayesian3 updates. TrueSkill provides not just the ability but also the uncertainty around each player’s skill, both of which can be used as features in a ML model.
Inferential vs predictive models
There are two cultures in the use of statistical modeling to reach conclusions from data. -Leo Breiman4
If we have TrueSkill, which predicts the win probability between two teams, why do we even need a ML model? TrueSkill is an inferential model, which attempts to explain the world through latent variables. Of course, a perfect model of the world would also make great predictions but, in practice, there is always a trade-off between explainability and predictive power. That is the biggest tension between statistics and ML.
ML models are typically less interpretable black boxes but much more powerful at making predictions. They can incorporate a wide range of features, including but not limited to those provided by TrueSkill, with the sole focus of optimising a loss function, which generally translates into better predictions.
Dataset
Here is the matches dataset with all the features and target together, including the TrueSkill features. I don’t provide the actual feature engineering code for the sake of brevity, as this post is long enough as it is.
Modelling
XGBoost is all you need. -Bojan Tunguz
The modelling done here is pretty standard tabular ML with a couple of notable exceptions:
- We remove ties, which represent roughly 1.5% of the dataset
- We do data augmentation by swapping team1 and team2 features and adding both rows to the training set
- We can do that as there is no “home advantage” in CS:GO like there is in football5
- When making predictions, we average the predictions across both scenarios
Out-of-time train-test split
We use out-of-time split instead of the more typical cross-validation. In pretty much any real-life application, a model is trained with past data and ends up used to predict future unseen data. Your evaluation should reflect that, as you might be interested to know how your model performance degrades over time (which could be caused, for example, by concept drift)6.
Code
dataset = dataset.drop_duplicates()
dt_train = '2019-01-01'
dt_test = '2019-08-01'
dataset['target'] = (dataset['winner'] == 'team1').astype(bool)
dataset = dataset[
(dataset['match_date'] >= '2017-01-01') &
(dataset['winner'] != 'tie') &
(dataset['match_id'] != 'https://www.hltv.org/matches/2332976/lucid-dream-vs-alpha-red-esl-pro-league-season-9-asia')
].reset_index(drop=True)
mask_train = dataset['match_date'] < dt_train
dataset_train = dataset.loc[mask_train].reset_index(drop=True)
dataset_train2 = dataset_train.sample(frac=1).reset_index(drop=True)
dataset_train2['target'] = ~dataset_train2['target']
cols = []
# Swapping team features
for c in list(dataset_train.columns):
if c.startswith('team1_'):
cols.append(c.replace('team1_', 'team2_').replace('_team2', '_team1'))
elif c.startswith('team2_'):
cols.append(c.replace('team2_', 'team1_').replace('_team1', '_team2'))
else:
cols.append(c)
dataset_train2 = dataset_train2.rename(columns=dict(zip(dataset_train.columns, cols)))
dataset_train = dataset_train[cols]
dataset_train2 = dataset_train2[cols]
dataset_train = pd.concat([dataset_train, dataset_train2], axis=0, ignore_index=True).reset_index(drop=True)
idxs = np.random.choice(len(dataset_train), replace=False, size=4000)
dataset_val = dataset_train.loc[idxs].drop_duplicates('match_id').reset_index(drop=True)
dataset_val = dataset_val.reset_index(drop=True)
index = np.arange(len(dataset_train))
mask = ~np.in1d(index, idxs)
dataset_train = dataset_train.loc[mask].reset_index(drop=True)
mask_test = (
(dataset['match_date'] >= dt_train) &
(dataset['match_date'] < dt_test)
)
dataset_test = dataset.loc[mask_test].reset_index(drop=True)dataset_train.shape, dataset_val.shape, dataset_test.shape((38490, 267), (3798, 267), (4759, 267))
Code
dataset['match_date'] = pd.to_datetime(dataset['match_date'])
# Create weekly match counts
match_counts = dataset.groupby(dataset['match_date'].dt.to_period('W')).size().reset_index(name='count')
match_counts['match_date'] = match_counts['match_date'].dt.to_timestamp()
# Define color for each period
match_counts['period'] = 'Train'
match_counts.loc[match_counts['match_date'] >= pd.to_datetime(dt_train), 'period'] = 'Test'
# For validation, we'll consider it as part of the train set but with a different color
val_mask = dataset_val['match_date'].dt.to_period('W').value_counts().reset_index()
val_mask.columns = ['match_date', 'val_count']
match_counts = match_counts.merge(val_mask, on='match_date', how='left')
match_counts['val_count'] = match_counts['val_count'].fillna(0)
match_counts.loc[match_counts['val_count'] > 0, 'period'] = 'Validation'
# Create the plot
fig = px.line(match_counts, x='match_date', y='count', color='period',
title='Number of Matches Over Time (Weekly)',
labels={'count': 'Number of Matches', 'match_date': 'Date'},
color_discrete_map={'Train': 'blue', 'Validation': 'green', 'Test': 'red'})
# Add vertical lines for train/test split
fig.add_vline(x=dt_train, line_dash="dash", line_color="gray")
# Add annotation for the train/test split
fig.add_annotation(x=dt_train, y=1, yref="paper", showarrow=False,
text="Train/Test Split", textangle=-90, xanchor="right")
# Update layout for better readability
fig.update_layout(
legend_title_text='Dataset',
xaxis_title="Date",
yaxis_title="Number of Matches per Week",
)
fig.show()Model: LightGBM
We use a standard off-the-shelf LightGBM binary classifier. There are many advantages to use LightGBM or XGBoost for tabular data problems (either choice is fine!):
- Handles missing values natively
- Handles categorical features natively
- Early stopping to optimize the number of estimators
- Blazing fast and scalable
- Multiple loss functions options, including using a custom one
- For binary classification, the default is the negative logloss (a proper scoring rule, which should lead to well-calibrated probabilities)
- You can use SHAP for feature importance and explanations
For more information on how to unlock the power of LightGBM, watch my PyData London 2022 presentation.
class CSGOPredictor:
"""
A predictor class for CS:GO match outcomes using LightGBM.
"""
def __init__(self, model_params: Dict[str, Any]):
"""
Initialize the CSGOPredictor.
Args:
model_params (Dict[str, Any]): Parameters for the LightGBM model.
"""
self.model_params = model_params
self.lgb = None # Will be initialized in the fit method
def fit(self, x_train: pd.DataFrame, y_train: np.ndarray,
x_val: pd.DataFrame, y_val: np.ndarray) -> 'CSGOPredictor':
"""
Fit the LightGBM model on the training data.
Args:
x_train (pd.DataFrame): Training features.
y_train (np.ndarray): Training labels.
x_val (pd.DataFrame): Validation features.
y_val (np.ndarray): Validation labels.
Returns:
CSGOPredictor: The fitted predictor object.
"""
self.lgb = LGBMClassifier(**self.model_params)
self.lgb.fit(
x_train, y_train,
eval_set=[(x_train, y_train), (x_val, y_val)],
eval_names=['training', 'validation'],
callbacks=[
early_stopping(stopping_rounds=25),
log_evaluation(period=50), # Log every 50 iterations
]
)
return self
def predict_proba(self, x: pd.DataFrame) -> np.ndarray:
"""
Predict probabilities for match outcomes.
This method performs predictions twice with swapped team features and averages the results.
Args:
x (pd.DataFrame): Input features for prediction.
Returns:
np.ndarray: Predicted probabilities for each class.
"""
# Original predictions
original = self.lgb.predict_proba(x)
# Create a copy of the input data for feature swapping
x_inv = x.copy()
# Identify team1 and team2 columns
team1_cols = [i for i in x_inv.columns if i.startswith('team1')]
team2_cols = [i for i in x_inv.columns if i.startswith('team2')]
# Swap team1 and team2 features
x_inv = x_inv.rename(dict(zip(team1_cols + team2_cols, team2_cols + team1_cols)), axis=1)
x_inv = x_inv.reindex(columns=x.columns)
# Predictions with swapped features
inv = self.lgb.predict_proba(x_inv)
# Swap the probabilities for team1 and team2
inv[:, 0], inv[:, 1] = inv[:, 1], inv[:, 0].copy()
# Average the original and swapped predictions
return (original + inv) / 2.0
def predict(self, x: pd.DataFrame) -> np.ndarray:
"""
Predict the class labels for the input data.
Args:
x (pd.DataFrame): Input features for prediction.
Returns:
np.ndarray: Predicted class labels.
"""
return self.predict_proba(x).argmax(axis=1)model_params = {
'n_estimators': 10_000, # With early stopping, we will use many fewer trees than that
'learning_rate': 0.05
}Code
drop_cols = ['winner', 'match_date', 'match_id', 'event_id', 'team1_id', 'team2_id', 'target']
x_train = dataset_train.drop(columns=drop_cols, axis=1)
y_train = dataset_train['target']
features = list(x_train.columns)
x_val = dataset_val[features]
y_val = dataset_val['target']
x_test = dataset_test[features]
y_test = dataset_test['target']model = CSGOPredictor(model_params).fit(x_train, y_train, x_val, y_val)Training until validation scores don't improve for 25 rounds
[50] training's binary_logloss: 0.563521 validation's binary_logloss: 0.586438
[100] training's binary_logloss: 0.536487 validation's binary_logloss: 0.58016
[150] training's binary_logloss: 0.517125 validation's binary_logloss: 0.578305
Early stopping, best iteration is:
[145] training's binary_logloss: 0.51895 validation's binary_logloss: 0.578226
Feature importance
Here is the “beeswarm” view of SHAP values. It shows not just the importance but also how each feature influences the prediction logits7. You can also apply SHAP to individual samples to understand what features caused their prediction logits.
explainer = shap.Explainer(model.lgb)
shap_values = explainer(x_test)
shap.plots.beeswarm(shap_values, max_display=20)
Unsurprisingly, the TrueSkill win probability features are the most important ones. In a sense, this can be seen as a form of stacking, since TrueSkill is another model. Other important features relate to the team’s past performance, like KD ratio and ADR.
Are ~250 features really necessary? Probably not, especially with just 30k samples8. We didn’t do any feature selection, but I’d do permutation importance and adversarial validation on a time split if I had more time on my hands9.
Evaluation
We evaluate using the following metrics:
- Accuracy: how many bets you expect to get right
- AUC10: how well you rank-order the winners/losers
- Brier score: a metric takes both calibration and accuracy into account
I also plot the calibration curves for the training and test sets.
Code
def calculate_metrics(X, y, model):
y_pred_proba = model.predict_proba(X)[:, 1]
y_pred = model.predict(X)
return {
'Accuracy': accuracy_score(y, y_pred),
'AUC': roc_auc_score(y, y_pred_proba),
'Brier_score': brier_score_loss(y, y_pred_proba)
}
metrics_train = calculate_metrics(x_train, y_train, model)
metrics_val = calculate_metrics(x_val, y_val, model)
metrics_test = calculate_metrics(x_test, y_test, model)
metrics_df = pd.DataFrame([metrics_train, metrics_val, metrics_test],
index=['Training', 'Validation', 'Test'])metrics_df| Accuracy | AUC | Brier_score | |
|---|---|---|---|
| Training | 0.738555 | 0.824180 | 0.173662 |
| Validation | 0.718536 | 0.794228 | 0.185571 |
| Test | 0.700567 | 0.772830 | 0.191394 |
Code
def plot_calibration_curve(y_true, y_pred_proba, set_name, fig, color):
mean_predicted_value, fraction_of_positives = calibration_curve(y_true, y_pred_proba, n_bins=10)
fig.add_trace(go.Scatter(
x=mean_predicted_value, y=fraction_of_positives,
mode='lines+markers', name=f'{set_name} set',
line=dict(color=color)
))
# Create a new figure for the calibration plot
calibration_fig = go.Figure()
# Add the perfectly calibrated line
calibration_fig.add_trace(go.Scatter(
x=[0, 1], y=[0, 1],
mode='lines', name='Perfectly calibrated',
line=dict(dash='dot')
))
# Plot calibration curve for the training set
plot_calibration_curve(y_train, model.predict_proba(x_train)[:, 1], 'Training', calibration_fig, 'blue')
# Plot calibration curve for the test set
plot_calibration_curve(y_test, model.predict_proba(x_test)[:, 1], 'Test', calibration_fig, 'red')
# Set layout properties for the calibration plot
calibration_fig.update_layout(
title="Calibration plot",
xaxis_title="Mean predicted value",
yaxis_title="Fraction of positives",
xaxis=dict(tickvals=[i/10 for i in range(11)], range=[0, 1]),
yaxis=dict(tickvals=[i/10 for i in range(11)], range=[0, 1]),
showlegend=True
)
calibration_fig.show()The model seems well calibrated (slightly more so on the test set than on the training set, a welcome surprise), which makes it useful for betting: Recall from the previous post that our betting decision rule is based on the probability of team 1 or 2 winning. If you use a probability for decision making, it generally needs to be calibrated.
If the model wasn’t well calibrated, we could use Isotonic regression on a validation set to fix that. There are other options for post-hoc model calibration like Platt scaling, but Isotonic regression works best for tree-based models.
Code
def auc_over_time(df, model, date_col, target_col, features):
# Make a copy to avoid modifying the original dataframe and convert match_date to datetime
weekly_df = df.copy()
weekly_df[date_col] = pd.to_datetime(weekly_df[date_col])
# Create a 'week_start_date' column for grouping that represents the start of the week
weekly_df['week_start_date'] = weekly_df[date_col].dt.to_period('W').apply(lambda r: r.start_time)
# Initialize a dictionary to store AUC for each week
weekly_auc = {}
for week_start_date, group in weekly_df.groupby('week_start_date'):
if not group.empty:
X = group[features]
y = group[target_col]
auc = roc_auc_score(y, model.predict_proba(X)[:, 1])
weekly_auc[week_start_date] = auc
return pd.Series(weekly_auc)
def acc_over_time(df, model, date_col, target_col, features):
# Make a copy to avoid modifying the original dataframe and convert match_date to datetime
weekly_df = df.copy()
weekly_df[date_col] = pd.to_datetime(weekly_df[date_col])
# Create a 'week_start_date' column for grouping that represents the start of the week
weekly_df['week_start_date'] = weekly_df[date_col].dt.to_period('W').apply(lambda r: r.start_time)
# Initialize a dictionary to store AUC for each week
weekly_auc = {}
for week_start_date, group in weekly_df.groupby('week_start_date'):
if not group.empty:
X = group[features]
y = group[target_col]
auc = accuracy_score(y, model.predict(X))
weekly_auc[week_start_date] = auc
return pd.Series(weekly_auc)Code
# Calculate weekly AUC for training and test sets
weekly_auc_train = auc_over_time(dataset_train, model, 'match_date', 'target', features)
weekly_auc_test = auc_over_time(dataset_test, model, 'match_date', 'target', features)
# Plotting the AUC over time using Plotly
trace0 = go.Scatter(
x=weekly_auc_train.index,
y=weekly_auc_train.values,
mode='lines+markers',
name='Training Set',
line=dict(color='blue')
)
trace1 = go.Scatter(
x=weekly_auc_test.index,
y=weekly_auc_test.values,
mode='lines+markers',
name='Test Set',
line=dict(color='red')
)
layout = go.Layout(
title='AUC Over Time',
xaxis=dict(title='Week Start Date'),
yaxis=dict(title='AUC'),
showlegend=True
)
fig = go.Figure(data=[trace0, trace1], layout=layout)
fig.add_hline(y=0.5, line_dash="dash", line_color="black",
annotation_text="Random prediction", annotation_position="bottom right")
avg_train_auc = weekly_auc_train.mean()
avg_test_auc = weekly_auc_test.mean()
# Training set average line for the training period
fig.add_shape(type='line',
x0=weekly_auc_train.index.min(), y0=avg_train_auc,
x1=weekly_auc_train.index.max(), y1=avg_train_auc,
line=dict(dash='dash', color='blue', width=2),
xref='x', yref='y')
# Test set average line for the test period
fig.add_shape(type='line',
x0=weekly_auc_test.index.min(), y0=avg_test_auc,
x1=weekly_auc_test.index.max(), y1=avg_test_auc,
line=dict(dash='dash', color='red', width=2),
xref='x', yref='y')
# Add annotations for the averages
fig.add_annotation(x=weekly_auc_train.index.max(), y=avg_train_auc,
text=f"Train Avg: {avg_train_auc:.2f}", showarrow=False, yshift=10, bgcolor="white")
fig.add_annotation(x=weekly_auc_test.index.max(), y=avg_test_auc,
text=f"Test Avg: {avg_test_auc:.2f}", showarrow=False, yshift=10, bgcolor="white")
fig.show()There is a train-test performance gap, which implies overfitting but that’s not a big concern per se. We really care that the out-of-time performance is good enough, which will be evaluated with the backtest below. Overfitting is normal with gradient-boosted trees model, but its generalization performance is still better than other models like logistic regression or random forests (I will leave model comparison as an exercise to the reader).
Also, note that there is a big drop in the last 3 weeks of the test dataset. That is exactly when I lost most money! There was some kind of drift or event in that period which made the model perform much worse. That also suggests we should not let the model go for more than 6 months without re-training. Unfortunately, when we first started to place the bets, we didn’t have that test set yet.
Code
# Calculate weekly AUC for training and test sets
weekly_acc_train = acc_over_time(dataset_train, model, 'match_date', 'target', features)
weekly_acc_test = acc_over_time(dataset_test, model, 'match_date', 'target', features)
# Plotting the AUC over time using Plotly
trace0 = go.Scatter(
x=weekly_acc_train.index,
y=weekly_acc_train.values,
mode='lines+markers',
name='Training Set',
line=dict(color='blue')
)
trace1 = go.Scatter(
x=weekly_acc_test.index,
y=weekly_acc_test.values,
mode='lines+markers',
name='Test Set',
line=dict(color='red')
)
layout = go.Layout(
title='Accuracy Over Time',
xaxis=dict(title='Week Start Date'),
yaxis=dict(title='Accuracy'),
showlegend=True
)
fig = go.Figure(data=[trace0, trace1], layout=layout)
fig.add_hline(y=0.5, line_dash="dash", line_color="black",
annotation_text="Random prediction", annotation_position="bottom right")
avg_train_acc = weekly_acc_train.mean()
avg_test_acc = weekly_acc_test.mean()
# Training set average line for the training period
fig.add_shape(type='line',
x0=weekly_acc_train.index.min(), y0=avg_train_acc,
x1=weekly_acc_train.index.max(), y1=avg_train_acc,
line=dict(dash='dash', color='blue', width=2),
xref='x', yref='y')
# Test set average line for the test period
fig.add_shape(type='line',
x0=weekly_acc_test.index.min(), y0=avg_test_acc,
x1=weekly_acc_test.index.max(), y1=avg_test_acc,
line=dict(dash='dash', color='red', width=2),
xref='x', yref='y')
# Add annotations for the averages
fig.add_annotation(x=weekly_acc_train.index.max(), y=avg_train_acc,
text=f"Train Avg: {avg_train_acc:.2f}", showarrow=False, yshift=10, bgcolor="white")
fig.add_annotation(x=weekly_acc_test.index.max(), y=avg_test_acc,
text=f"Test Avg: {avg_test_acc:.2f}", showarrow=False, yshift=10, bgcolor="white")
fig.show()The accuracy plot is is similar to AUC in almost all aspects. Note that we’re much better than predicting at random, but that is not a good baseline here. A much better baseline would be the accuracy calculated with the probabilities implied by the betting odds.
Backtesting
Past performance is no guarantee of future results.
Backtesting is replaying the past with your model decisions. One example of backtesting is the following:
- Train model with data up to a certain date
- Sample betting odds for the next matches
- Make bets for those next matches according to your betting strategy
- Repeat 1-3 until you cover all the test data
- Evaluate ML metrics (e.g. AUC) and business metrics (e.g ROI) on your bets
Backtesting allows us to assess our financial performance, which matters a lot more than ML metrics. For example, is an AUC of 0.77 good or bad? That is hard to tell in general, while a ROI of 1.1 is something we can understand and compare to other strategies (including leaving your money in the bank to earn risk-free interest).
Here, we only assess the ROI of the bets, not other financial metrics like the Sharpe ratio or max drawdown.
For simplicity, we just train the model once and keep it fixed for all future bets, which makes it a more conservative backtest.
First, let’s download the dataset with matches with betting odds:
Code
dataset_with_odds = pd.read_parquet("match_predictions_with_odds.parquet")
dataset_with_odds = dataset_with_odds[["match_id", "team1_odds", "team2_odds"]]
dataset_with_odds = dataset_with_odds.merge(dataset_test, on="match_id")
dataset_with_odds['match_date'] = pd.to_datetime(dataset_with_odds['match_date'])
dataset_with_odds = dataset_with_odds.sort_values(by='match_date')dataset_with_odds.shape(1113837, 269)
Now, let’s simulate our betting strategy:
- For each match, sample just one betting odd at random
- Only bet if winning probability is over 50% AND
- Only bet if the probability of winning is greater than the implied probability by the odds plus a delta of 1%
- The bet can either be a fixed amount or determined by the Kelly criterion (here, for simplicity, I only show fixed betting – see previous blog post for a discussion on the Kelly criterion and some variants)
The first premise sounds odd: shouldn’t we pick the best possible betting odd? Not really, for two real-life reasons: 1. For risk management, you don’t want to bet multiple times on the same match 2. You might not be able to bet when you want for multiple reasons (e.g. you are asleep).
There was some trial and error involved in designing our betting strategy and I’m sure there is room for improvement. The delta of 1% is our safety margins due to model error and we found it with a grid search. It’s a parameter you can play with in the simulation below:
# Constants
MIN_PROBA = 0.5
MIN_DELTA_PROBA = 0.01
N_SIMS = 200
all_samples_data = []
for _ in range(N_SIMS):
# Sample one row per match
df = (dataset_with_odds.groupby('match_id')
.apply(lambda x: x.sample(1))
.reset_index(drop=True))
# Predict probabilities
predict_proba = model.predict_proba(df[features])
df['team1_proba'] = predict_proba[:, 1]
df['team2_proba'] = predict_proba[:, 0]
# Calculate implied probabilities from odds
df['team1_implied_prob'] = 1 / df['team1_odds']
df['team2_implied_prob'] = 1 / df['team2_odds']
# Determine whether to bet based on probabilities and odds
df['team1_bet'] = (df.team1_proba > MIN_PROBA) & (df.team1_proba > (df.team1_implied_prob + MIN_DELTA_PROBA))
df['team2_bet'] = (df.team2_proba > MIN_PROBA) & (df.team2_proba > (df.team2_implied_prob + MIN_DELTA_PROBA))
# Calculate returns
df['team1_returns'] = np.where(df.team1_bet & (df.winner == 'team1'), df['team1_odds'], 0.0)
df['team2_returns'] = np.where(df.team2_bet & (df.winner == 'team2'), df['team2_odds'], 0.0)
# Calculate profit/loss
df['loss'] = df['team1_bet'].astype(int) + df['team2_bet'].astype(int)
df['revenue'] = df['team1_returns'] + df['team2_returns']
df['profit'] = df['revenue'] - df['loss']
all_samples_data.append(df)Code
# Combine all sample data and prepare for analysis
all_samples_df = pd.concat(all_samples_data).reset_index(drop=True)
all_samples_df['match_date'] = pd.to_datetime(all_samples_df['match_date'])
all_samples_df.sort_values(by='match_date', inplace=True)
# Calculate cumulative profit
all_samples_df['cumulative_profit'] = all_samples_df.groupby('match_date')['profit'].cumsum()
# Calculate daily profit and cumulative profit
daily_profit_sum = (all_samples_df.groupby('match_date')['profit']
.sum()
.reset_index())
daily_profit_sum['cumulative_profit'] = daily_profit_sum['profit'].cumsum() / N_SIMS
# Calculate total profits and ROI
total_profits = all_samples_df['profit'].sum()
total_bets = all_samples_df['loss'].sum() # Assumes 'loss' represents the number of bets
roi = total_profits / total_bets if total_bets > 0 else 0
# Calculate annualized ROI
min_date = all_samples_df['match_date'].min()
max_date = all_samples_df['match_date'].max()
duration_years = (max_date - min_date) / pd.Timedelta(days=365.25)
annualized_roi = (roi + 1) ** (1 / duration_years) - 1 if duration_years > 0 else 0print(f"Backtest ROI: {round(roi*100)}%")
print(f"Annualized ROI: {round(annualized_roi*100)}%")Backtest ROI: 10%
Annualized ROI: 64%
The ROI after 2 months is 10%, which annualized would be 63%, not bad at all! For reference, the risk free interest rate in the US today is around 5% per year, while the average S&P500 returns are roughly 10% a year.
We did have an edge after all, or so it seemed. Let’s see the uncertainty across multiple simulations:
Code
# Create a Plotly figure
fig = go.Figure()
# Add traces for each sample's cumulative profits
for sample_data in all_samples_data:
# Make sure to sort the sample_data by 'match_date'
sample_data_sorted = sample_data.sort_values(by='match_date')
fig.add_trace(go.Scatter(
x=sample_data_sorted['match_date'],
y=sample_data_sorted['profit'].cumsum(),
mode='lines',
line=dict(width=1, color='lightgrey'),
showlegend=False
))
# Add a trace for the average cumulative profits per date
fig.add_trace(go.Scatter(
x=daily_profit_sum['match_date'],
y=daily_profit_sum['cumulative_profit'],
mode='lines',
name='Avg Cum. Profits',
line=dict(width=3, color='blue')
))
# Adding ROI text
fig.add_trace(go.Scatter(
x=[daily_profit_sum['match_date'].iloc[-1] + pd.DateOffset(days=4)],
y=[daily_profit_sum['cumulative_profit'].iloc[-1]],
text=[f"ROI: {roi:.2f}"], # The ROI text
mode="text",
showlegend=False,
textfont=dict( # Adjust the font properties here
size=14,
color='black',
)
))
# Update layout to add titles and make it more informative
fig.update_layout(
title="Cumulative Profits over Time with Average",
xaxis_title="Match Date",
yaxis_title="Cumulative Profit",
legend_title="Legend",
template="plotly_white",
xaxis=dict(
type='date' # Ensure that x-axis is treated as date
)
)
# Show the figure
fig.show()